Improvement of a physiological articulatory model for synthesis of vowel sequences
نویسندگان
چکیده
A 3D physiological articulatory model has been constructed based on volumetric MRI data obtained from a male speaker. The model is driven by muscles according to a target-dependent activation pattern. In this study, we improved dynamic characteristics of the model to produce higher sound quality for vowel sequences. Dynamic characteristics of articulatory organs were investigated using X-ray microbeam data for vowel sequences and vowel-consonant-vowel (VCV) sequences for 11 Japanese speakers. It was found that the velocity of the tongue tip is about 60% faster in transition of vowel-to-consonant than that of vowel-to-vowel, while the velocities of the tongue dorsum and jaw were independent of the sequences. Reaction time, from maximal acceleration to maximal velocity, of the articulators is about 40% shorter in vowel-to-consonant transitions than in vowel-to-vowel transitions. To apply the improved model for speech analysis, articulatory targets were estimated for the vowels in vowel sequences using AbS method, and used to generate the vocal tract shapes for vowel sequences. The vocal tract shapes and synthetic sounds were compared with speech sound and articulatory data from the target speaker. The results showed that our model demonstrates plausible dynamic characteristics of articulatory movement in producing vowel sequences. The simulation error was about 2.5% for the formants, and 0.2 cm for the observation points of the vocal tract. 1. CONSTRUCTION AND IMPROVEMENT OF PHYSIOLOGICAL ARTICULATORY MODEL A 3-D physiological articulatory model driven by muscle contraction has been developed for human-mimetic speech synthesis. The articulatory model was constructed based on MRI data obtained from a male speaker. The entire model consists of the tongue, jaw and the vocal tract wall. 1.1. Construction of a Physiological Articulatory Model The tongue shapes were extracted from volumetric MRI data in the midsagittal and parasagittal planes. The basic structure of the tongue tissue model roughly replicates the fiber orientation of the genioglossus muscle. The central part of the tongue including the genioglossus is represented by a 2-cm-thick layer bounded with three sagittal planes. Each plane is divided into six sections with nearly equal intervals in the anterior-posterior direction and ten sections along the tongue surface. The 3D tongue model is constructed by connecting the section nodes in the midsagittal plane to the corresponding ones in the left and the right planes using viscoelastic springs. This model is capable of forming the midsagittal groove and the side airway, which are the essential behaviors of the tongue in producing vowels and consonants. The jaw and hyoid bone are modeled as a rigid body to yield rotation and translation motions. Outlines of the vocal tract wall were also extracted from MRI data in the midsagittal and parasagittal planes (0.7 and 1.4 cm apart from the midsagittal plane on the right side). Assuming that the left and right sides are symmetric, 3D surface shells of the vocal tract wall and the mandibular symphysis wall were constructed using the MRI-derived outlines (see [1,2] for details). In the present stage, the lips and the velum are not modeled physiologically. The lips are defined by a short tube with a length and cross-sectional area, and the velum situation is determined by the opening area of the naso-pharyngeal port. These parameters are taken into account as acoustic parameters in speech synthesis stage. 1.2 Control Strategy of the Model The model is driven by twelve muscles for the tongue and eight muscles for the jaw. Muscle activation signals are estimated using target-based control strategy. Three control points are used in the control strategy; the tongue tip, tongue dorsum, and jaw. The control point for the tongue tip is the apex of the tongue in the midsagittal plane. The control point for the dorsum is the weighted average position of the highest three points in the initial configuration in the midsagittal plane. The control point for the jaw is 0.5 cm inferior to the tip of the mandible incisor. The articulatory target is the coordinate of the final position for each control point during a stable phonation. The target-based control strategy is to generate muscle activation signals according to a given target for each control point, and feed the activation signals into the muscles to drive the model. For this purpose, a muscle workspace is constructed for each control point to establish a relationship between the target and the activation signals [1]. The muscle workspace consists of muscle force vectors that correspond to a displacement of the control point when the muscles contract. Since orientation of the muscles varies with articulatory movement, the muscle force vector must adjust to the locations of the control points. Therefore, a set of muscle force vectors is calculated for the control points of the tongue in four positions corresponding to the tongue shapes of the rest posture and three extreme vowels of /a/, /i/, and /u/ [2]. For the jaw muscle vector, positions are chosen in the rest position and a ! " ISCA Archive
منابع مشابه
Model-Based Reproduction of Articulatory Trajectories for Consonant-Vowel Sequences
We present a novel quantitative model for the generation of articulatory trajectories based on the concept of sequential target approximation. The model was applied for the detailed reproduction of movements in repeated consonant-vowel syllables measured by electromagnetic articulography (EMA). The trajectories for the constrictor (lower lip, tongue tip, or tongue dorsum) and the jaw were repro...
متن کاملVowel Creation by Articulatory Control in HMM-based Parametric Speech Synthesis
Hidden Markov model (HMM)-based parametric speech synthesis has become a mainstream speech synthesis method in recent years. This method is able to synthesise highly intelligible and smooth speech sounds. In addition, it makes speech synthesis far more flexible compared to the conventional unit selection and waveform concatenation approach. Several adaptation and interpolation methods have been...
متن کاملStudy on the Anticipatory Coariticulatory Effect of Chinese Disyllabic Sequences
In this study, the Vowel-to-Vowel (V-to-V) coarticulatory effect in the Vowel-Consonant-Vowel (VCV) sequences is investigated, and the F2 offset value of the first vowel is analyzed. Results show that, in the trans-segment context, anticipatory coarticulation exists in Chinese. Due to high articulatory strength of aspirated obstruents, in the context of subsequent vowel /i/, the V1 F2 offset va...
متن کاملModeling Consonant-Vowel Coarticulation for Articulatory Speech Synthesis
A central challenge for articulatory speech synthesis is the simulation of realistic articulatory movements, which is critical for the generation of highly natural and intelligible speech. This includes modeling coarticulation, i.e., the context-dependent variation of the articulatory and acoustic realization of phonemes, especially of consonants. Here we propose a method to simulate the contex...
متن کاملVCV Synthesis Using Task Dynamics to Animate a Factor-Based Articulatory Model
This paper presents an initial architecture for articulatory synthesis which combines a dynamical system for the control of vocal tract shaping with a novel MATLAB implementation of an articulatory synthesizer. The dynamical system controls a speaker-specific vocal tract model derived by factor analysis of mid-sagittal real-time MRI data and provides input to the articulatory synthesizer, which...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2000